GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 7 - Natural Language Processing/[R] Natural Language Processing.ipynb
¹³³⁵ views

Kernel: R

Natural Language Processing

Data Preprocessing

In [1]:

# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', 
                    quote = '',
                    stringsAsFactors = FALSE)

In [2]:

head(dataset_original, 10)

Out[2]:

In [3]:

dim(dataset_original)

Out[3]:

Cleaning the texts

In [4]:

# install.packages('tm')
library(tm)
corpus = VCorpus(VectorSource(dataset_original$Review))

# Lowercase each word
corpus = tm_map(corpus, content_transformer(tolower))

Out[4]:

Loading required package: NLP

In [5]:

dataset_original$Review[1]

Out[5]:

In [6]:

as.character(corpus[[1]])

Out[6]:

In [7]:

# Removing all the numbers
corpus = tm_map(corpus, removeNumbers)

In [8]:

dataset_original$Review[29]

Out[8]:

In [9]:

as.character(corpus[[29]])

Out[9]:

In [10]:

# Removing all the Punctuation
corpus = tm_map(corpus, removePunctuation)

In [11]:

dataset_original$Review[1]

Out[11]:

In [12]:

as.character(corpus[[1]])

Out[12]:

In [13]:

# Removing stopwords eg. 'the', 'a', 'an', 'in', 'on' i.e all the preposition and articles
corpus = tm_map(corpus, removeWords, stopwords())

In [14]:

dataset_original$Review[1]

Out[14]:

In [15]:

as.character(corpus[[1]])

Out[15]:

In [16]:

# Stemming
# install.packages('SnowballC')
corpus = tm_map(corpus, stemDocument)

In [17]:

dataset_original$Review[1]

Out[17]:

In [18]:

as.character(corpus[[1]])

Out[18]:

In [19]:

# Removing white space if any
# corpus = tm_map(corpus, stripWhitespace)

Creating the Bag of Words model

In [20]:

dtm = DocumentTermMatrix(corpus)

In [21]:

dim(dtm)

Out[21]:

In [22]:

dtm

Out[22]:

<<DocumentTermMatrix (documents: 1000, terms: 1577)>>
Non-/sparse entries: 5435/1571565
Sparsity           : 100%
Maximal term length: 32
Weighting          : term frequency (tf)

In [23]:

# Filter words that are not frequent
dtm = removeSparseTerms(dtm, 0.999)
# Checking column for most 1

In [24]:

dtm

Out[24]:

<<DocumentTermMatrix (documents: 1000, terms: 691)>>
Non-/sparse entries: 4549/686451
Sparsity           : 99%
Maximal term length: 12
Weighting          : term frequency (tf)

In [25]:

dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked

In [26]:

# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(1234)
split = sample.split(dataset$Liked, SplitRatio = 0.80)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Fitting Naive Bayes to the Training set
# install.packages('e1071')
library(e1071)
classifier = naiveBayes(x = training_set[-692],
                        y = training_set$Liked)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])

# Making the Confusion Matrix
cm = table(test_set[, 692], y_pred)

In [27]:

cm

Out[27]:

   y_pred
1
9 91
7 93

In [28]:

# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(1234)
split = sample.split(dataset$Liked, SplitRatio = 0.80)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Fitting Random Forest to the Training set
# install.packages('randomForest')
library(randomForest)
classifier = randomForest(x = training_set[-692],
                          y = training_set$Liked,
                          ntree = 10)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])

# Making the Confusion Matrix
cm = table(test_set[, 692], y_pred)

Out[28]:

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

In [29]:

cm

Out[29]:

   y_pred
1
76 24
28 72

In [ ]:

Natural Language Processing

Data Preprocessing

Cleaning the texts

Creating the Bag of Words model

Product

Resources

Company